Attempt to optimize LocalExecutor #144

james7132 · 2025-07-24T04:23:16Z

This PR makes LocalExecutor it's own separate implementation instead of wrapping an Executor, forking it's own State type utilizing unsynchronized !Send types where possible: Arc -> Rc, Mutex -> RefCell, AtomicPtr -> Cell<*mut T>, ConcurrentQueue -> VecDeque. This implementation also removes any extra operations that assumes there are other concurrent Runners/Tickers (i.e. local queues, extra notifications).

For testing, I've duplicated most of the single-threaded compatible executor tests to ensure there's equivalent coverage on the new independent LocalExecutor.

I've also made an attempt to write this with UnsafeCell instead of RefCell, but this litters a huge amount of unsafe that might be too much for this crate. The gains here might not be worth it.

I previously wrote some additional benchmarks but lost the changes to a stray git reset --hard when benchmarking.
The gains here are substantial. Here are the results:

single_thread/local_executor::spawn_one
                        time:   [130.05 ns 130.98 ns 132.16 ns]
                        change: [-80.586% -80.413% -80.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
single_thread/local_executor::spawn_batch
                        time:   [17.195 µs 22.167 µs 32.281 µs]
                        change: [-30.007% +0.4082% +41.137%] (p = 0.98 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe
Benchmarking single_thread/local_executor::spawn_many_local: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 856.5s, or reduce sample count to 10.
single_thread/local_executor::spawn_many_local
                        time:   [2.7766 ms 2.8091 ms 2.8453 ms]
                        change: [-21.653% -8.5454% +6.4249%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
single_thread/local_executor::spawn_recursively
                        time:   [19.618 ms 20.071 ms 20.558 ms]
                        change: [-24.237% -22.461% -20.762%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
Benchmarking single_thread/local_executor::yield_now: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s, enable flat sampling, or reduce sample count to 50.
single_thread/local_executor::yield_now
                        time:   [1.8382 ms 1.8421 ms 1.8470 ms]
                        change: [-52.888% -52.744% -52.570%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
single_thread/local_executor::channels
                        time:   [9.9259 ms 9.9355 ms 9.9452 ms]
                        change: [-15.797% -15.656% -15.533%] (p = 0.00 < 0.05)
                        Performance has improved.
single_thread/local_executor::web_server
                        time:   [54.839 µs 56.776 µs 59.062 µs]
                        change: [-23.926% -18.725% -13.009%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

james7132 · 2025-07-24T15:04:30Z

Hmmm, VecDeque::new is not const until Rust 1.68. It could be trivially replaced with LinkedList which has a const constructor in 1.63, but it would be a major regression in performance.

taiki-e · 2025-07-25T13:35:42Z

There is no need to be worried about the MSRV of const VecDeque::new, because we will be able to raise the MSRV after about two weeks: smol-rs/smol#244 (comment)

james7132 · 2025-08-09T05:11:30Z

OK, did a more comprehensive benchmark run and it's pretty clear that the UnsafeCell version is tangibly faster:

group                                                     local-executor-opt                      local-executor-opt-w-unsafe-cell        master
-----                                                     ------------------                      --------------------------------        ------
single_thread/local_executor::channels                    1.18     12.2±0.69ms        ? ?/sec     1.00     10.3±0.08ms        ? ?/sec     1.41     14.6±0.93ms        ? ?/sec
single_thread/local_executor::spawn_many_local            1.11      2.7±0.20ms        ? ?/sec     1.00      2.4±0.15ms        ? ?/sec     1.25      3.0±0.20ms        ? ?/sec
single_thread/local_executor::spawn_one                   1.15    155.7±7.89ns        ? ?/sec     1.00   135.0±14.68ns        ? ?/sec     6.08   821.0±86.13ns        ? ?/sec
single_thread/local_executor::spawn_recursively           1.01     27.4±2.17ms        ? ?/sec     1.00     27.1±2.79ms        ? ?/sec     1.17     31.7±2.56ms        ? ?/sec
single_thread/local_executor::web_server                  1.25     20.2±1.63ms        ? ?/sec     1.00     16.2±0.10ms        ? ?/sec     1.60     25.9±1.69ms        ? ?/sec
single_thread/local_executor::yield_now                   1.18      2.3±0.12ms        ? ?/sec     1.00  1981.4±166.43µs        ? ?/sec    2.35      4.7±0.21ms        ? ?/sec
single_thread/static_local_executor::channels             1.38     13.8±1.80ms        ? ?/sec     1.00     10.0±0.13ms        ? ?/sec     1.63     16.3±1.70ms        ? ?/sec
single_thread/static_local_executor::spawn_many_local     1.20  1770.7±119.07µs        ? ?/sec    1.00  1474.5±28.53µs        ? ?/sec     1.70      2.5±0.11ms        ? ?/sec
single_thread/static_local_executor::spawn_one            1.49    154.8±8.22ns        ? ?/sec     1.00    104.0±6.94ns        ? ?/sec     7.28   756.9±66.28ns        ? ?/sec
single_thread/static_local_executor::spawn_recursively    1.53     26.4±2.18ms        ? ?/sec     1.00     17.3±1.05ms        ? ?/sec     1.61     27.8±3.10ms        ? ?/sec
single_thread/static_local_executor::web_server           1.19     20.0±1.05ms        ? ?/sec     1.00     16.7±0.32ms        ? ?/sec     1.71     28.6±3.43ms        ? ?/sec
single_thread/static_local_executor::yield_now            1.29      2.5±0.22ms        ? ?/sec     1.00  1916.6±165.43µs        ? ?/sec    2.54      4.9±0.19ms        ? ?/sec

This PR should be ready for review now.

james7132 · 2025-08-09T05:35:58Z

I don't think we can reasonably make LocalExecutor any faster with the current interface it has. Any more would probably require std LocalWaker support, a customized LocalTask<T> in async_task, or stack allocation of some kind.

james7132 · 2025-08-09T06:03:28Z

Spoke too soon, added an early-out when the queue is empty to LocalExecutor::run and nabbed some additional performance improvements:

group                                                     local-executor-opt-w-unsafe-cell        local-executor-opt-w-unsafe-cell-and-early-out-run    master
-----                                                     --------------------------------        --------------------------------------------------    ------
single_thread/local_executor::channels                    1.04     10.3±0.08ms        ? ?/sec     1.00      9.9±0.15ms        ? ?/sec                   1.47     14.6±0.93ms        ? ?/sec
single_thread/local_executor::spawn_many_local            1.42      2.4±0.15ms        ? ?/sec     1.00  1703.3±37.62µs        ? ?/sec                   1.78      3.0±0.20ms        ? ?/sec
single_thread/local_executor::spawn_one                   1.44   135.0±14.68ns        ? ?/sec     1.00     94.1±2.81ns        ? ?/sec                   8.73   821.0±86.13ns        ? ?/sec
single_thread/local_executor::spawn_recursively           1.58     27.1±2.79ms        ? ?/sec     1.00     17.2±0.84ms        ? ?/sec                   1.85     31.7±2.56ms        ? ?/sec
single_thread/local_executor::web_server                  1.00     16.2±0.10ms        ? ?/sec     1.00     16.1±0.14ms        ? ?/sec                   1.60     25.9±1.69ms        ? ?/sec
single_thread/local_executor::yield_now                   1.19  1981.4±166.43µs        ? ?/sec    1.00  1670.0±18.72µs        ? ?/sec                   2.79      4.7±0.21ms        ? ?/sec
single_thread/static_local_executor::channels             1.03     10.0±0.13ms        ? ?/sec     1.00      9.8±0.06ms        ? ?/sec                   1.67     16.3±1.70ms        ? ?/sec
single_thread/static_local_executor::spawn_many_local     1.11  1474.5±28.53µs        ? ?/sec     1.00  1323.9±83.38µs        ? ?/sec                   1.90      2.5±0.11ms        ? ?/sec
single_thread/static_local_executor::spawn_one            1.38    104.0±6.94ns        ? ?/sec     1.00     75.2±1.06ns        ? ?/sec                   10.07  756.9±66.28ns        ? ?/sec
single_thread/static_local_executor::spawn_recursively    1.20     17.3±1.05ms        ? ?/sec     1.00     14.4±0.85ms        ? ?/sec                   1.93     27.8±3.10ms        ? ?/sec
single_thread/static_local_executor::web_server           1.03     16.7±0.32ms        ? ?/sec     1.00     16.3±0.10ms        ? ?/sec                   1.76     28.6±3.43ms        ? ?/sec
single_thread/static_local_executor::yield_now            1.25  1916.6±165.43µs        ? ?/sec    1.00  1530.8±30.15µs        ? ?/sec                   3.18      4.9±0.19ms        ? ?/sec

notgull · 2025-08-13T03:08:50Z

@james7132 Can you rebase on master?

james7132 · 2025-08-13T09:41:06Z

@james7132 Can you rebase on master?

Done!

src/local_executor.rs

fogti · 2025-08-13T11:48:29Z

src/local_executor.rs

+    #[inline]
+    pub(crate) fn state_ptr(&self) -> *const State {
+        #[cold]
+        fn alloc_state(cell: &Cell<*mut State>) -> *mut State {


Why do we delay the allocation of the state? How beneficial is this from a performance perspective (given that it trades a potential "away optimization" of state allocation in case the LocalExecutor gets destroyed without usage for a branch on every state access (minus a few where the compiler can perhaps infer that the state is non-null / initialized))

It allows LocalExecutor::new to be const, without it this would be a breaking change and would require a major version bump.

Exexutor has this as well for similar reasons and has an AtomicPtr instead of a Cell.

That said, this comment makes me realize that the State in both satisfies all of the safety invariants of making the state be a Pin<Box<State>> instead of the reference counted types, and we can avoid the reference counting even without 'static borrows like the Static variants.

As far as I know this holds true even for the instances of either that get leaked into their Static variants.

As far as I'm aware, this sounds correct.

Okay, for the next major version bump it would be nice to pursue a version which doesn't use the internal Cell wrapping, at least for the LocalExecutor version, where "difficulty of sharing across threads" isn't that much of a concern.

Went ahead and implemented the pin change, and filed #146 for the non-local version.

src/local_executor.rs

src/lib.rs

Co-authored-by: Ellen Emilia Anna Zscheile <[email protected]>

fogti

overall LGTM.
but as this is a complicated change, it should still be reviewed by others.

cc @smol-rs/admins

notgull

I have to wonder if there's some kind of generic implementation we could do with the onset of GAT types. Like have one type that wraps it in Arc<T> and one type that wraps it in Rc<T>. I would prefer that to the duplicated implementation here.

src/local_executor.rs

notgull · 2025-08-27T01:42:44Z

Any chance you can look into using GATs for this one?

james7132 · 2025-08-27T03:24:35Z

Any chance you can look into using GATs for this one?

I can definitely try to do so, though I'm not sure how much duplication there actually is with how different the runners are.

src/local_executor.rs

fogti · 2025-08-27T10:44:48Z

src/local_executor.rs

+                // SAFETY: All UnsafeCell accesses to sleepers are tightly scoped, and because
+                // `LocalExecutor` is !Send, there is no way to have concurrent access to the
+                // values in `State`, including the sleepers field.
+                let sleepers = unsafe { &mut *self.state.sleepers.get() };


I think this let should be moved above the match, given that it is present in all branches anyways.

I moved it in to minimize the scope of the UnsafeCell borrows, though I guess it currently does not hurt to move it outside the match.

Co-authored-by: Ellen Emilia Anna Zscheile <[email protected]>

james7132 marked this pull request as ready for review August 9, 2025 05:11

james7132 requested review from notgull and taiki-e August 9, 2025 05:26

james7132 force-pushed the optimized-local-executor branch from d206459 to 13fbe6f Compare August 9, 2025 06:01

james7132 mentioned this pull request Aug 9, 2025

Add a trait for abstracing over executors smol-rs/futures-lite#93

Open

james7132 added 14 commits August 13, 2025 02:32

First attempt at a more optimized LocalExecutor

d641e7d

Remove remaining atomics

9c04535

Update the safety comments

9c526bb

CI Cleanup

f3ad06d

Update benchmarks

7903bd9

Use UnsafeCell instead of RefCell

b03a9d2

Safety comments

7e6cfe6

More safety comments

844aa79

Fix deprecated items

f25ec4e

Update MSRV to 1.68

2f27ec0

Use criterion 0.7

48a17a5

Formatting

099072b

Early-out in Runner when the queue is empty.

68344f3

Remove module qualification

c4f077f

james7132 force-pushed the optimized-local-executor branch from 13fbe6f to c4f077f Compare August 13, 2025 09:34

fogti reviewed Aug 13, 2025

View reviewed changes

james7132 added 2 commits August 13, 2025 16:50

Arc -> Rc

646c77a

Refactor spawn_inner

d0a1f9e

james7132 force-pushed the optimized-local-executor branch from 2b5adca to d0a1f9e Compare August 14, 2025 00:01

james7132 added 4 commits August 13, 2025 18:59

Use Pin instead of Rc cloning

c7a5a1d

Fix static executor builds

62a9331

Formatting

45e5b90

Update safety comment

7523891

fogti requested changes Aug 14, 2025

View reviewed changes

src/local_executor.rs Show resolved Hide resolved

fogti reviewed Aug 14, 2025

View reviewed changes

src/local_executor.rs Show resolved Hide resolved

Create a critical section that will abort on panic

6d84134

fogti reviewed Aug 16, 2025

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

Update src/lib.rs

5c8ed3a

Co-authored-by: Ellen Emilia Anna Zscheile <[email protected]>

fogti approved these changes Aug 16, 2025

View reviewed changes

notgull requested changes Aug 24, 2025

View reviewed changes

src/local_executor.rs Show resolved Hide resolved

fogti requested changes Aug 27, 2025

View reviewed changes

Update src/local_executor.rs

e3859dc

Co-authored-by: Ellen Emilia Anna Zscheile <[email protected]>

Attempt to optimize LocalExecutor #144

Are you sure you want to change the base?

Attempt to optimize LocalExecutor #144

Uh oh!

Conversation

james7132 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

james7132 commented Jul 24, 2025

Uh oh!

taiki-e commented Jul 25, 2025

Uh oh!

james7132 commented Aug 9, 2025

Uh oh!

james7132 commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

james7132 commented Aug 9, 2025

Uh oh!

notgull commented Aug 13, 2025

Uh oh!

james7132 commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

fogti Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

james7132 Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

james7132 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

fogti Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

fogti Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

james7132 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fogti left a comment

Choose a reason for hiding this comment

Uh oh!

notgull left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

notgull commented Aug 27, 2025

Uh oh!

james7132 commented Aug 27, 2025

Uh oh!

Uh oh!

fogti Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

james7132 Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

james7132 commented Jul 24, 2025 •

edited

Loading

james7132 commented Aug 9, 2025 •

edited

Loading

james7132 Aug 13, 2025 •

edited

Loading